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Abstract 

We  present  Beetle  II,  a  tutorial  dia¬ 
logue  system  designed  to  accept  unre¬ 
stricted  language  input  and  support  exper¬ 
imentation  with  different  tutorial  planning 
and  dialogue  strategies.  Our  first  system 
evaluation  used  two  different  tutorial  poli¬ 
cies  and  demonstrated  that  the  system  can 
be  successfully  used  to  study  the  impact 
of  different  approaches  to  tutoring.  In  the 
future,  the  system  can  also  be  used  to  ex¬ 
periment  with  a  variety  of  natural  language 
interpretation  and  generation  techniques. 

1  Introduction 

Over  the  last  decade  there  has  been  a  lot  of  inter¬ 
est  in  developing  tutorial  dialogue  systems  that  un¬ 
derstand  student  explanations  (Jordan  et  al.,  2006; 
Graesser  et  al.,  1999;  Aleven  et  al.,  2001;  Buckley 
and  Wolska,  2007;  Nielsen  et  al.,  2008;  VanLehn 
et  al.,  2007),  because  high  percentages  of  self¬ 
explanation  and  student  contentful  talk  are  known 
to  be  correlated  with  better  learning  in  human- 
human  tutoring  (Chi  et  al.,  1994;  Litman  et  al., 
2009;  Purandare  and  Litman,  2008;  Steinhauser  et 
al.,  2007).  However,  most  existing  systems  use 
pre-authored  tutor  responses  for  addressing  stu¬ 
dent  errors.  The  advantage  of  this  approach  is  that 
tutors  can  devise  remediation  dialogues  that  are 
highly  tailored  to  specific  misconceptions  many 
students  share,  providing  step-by-step  scaffolding 
and  potentially  suggesting  additional  problems. 
The  disadvantage  is  a  lack  of  adaptivity  and  gen¬ 
erality:  students  often  get  the  same  remediation 
for  the  same  error  regardless  of  their  past  perfor¬ 
mance  or  dialogue  context,  as  it  is  infeasible  to 


author  a  different  remediation  dialogue  for  every 
possible  dialogue  state.  It  also  becomes  more  dif¬ 
ficult  to  experiment  with  different  tutorial  policies 
within  the  system  due  to  the  inherent  completixites 
in  applying  tutoring  strategies  consistently  across 
a  large  number  of  individual  hand-authored  reme¬ 
diations. 

The  Beetle  II  system  architecture  is  designed 
to  overcome  these  limitations  (Callaway  et  al., 
2007).  It  uses  a  deep  parser  and  generator,  to¬ 
gether  with  a  domain  reasoner  and  a  diagnoser, 
to  produce  detailed  analyses  of  student  utterances 
and  generate  feedback  automatically.  This  allows 
the  system  to  consistently  apply  the  same  tutorial 
policy  across  a  range  of  questions.  To  some  extent, 
this  comes  at  the  expense  of  being  able  to  address 
individual  student  misconceptions.  However,  the 
system's  modular  setup  and  extensibility  make  it 
a  suitable  testbed  for  both  computational  linguis¬ 
tics  algorithms  and  more  general  questions  about 
theories  of  learning. 

A  distinguishing  feature  of  the  system  is  that  it 
is  based  on  an  introductory  electricity  and  elec¬ 
tronics  course  developed  by  experienced  instruc¬ 
tional  designers.  The  course  was  first  created  for 
use  in  a  human-human  tutoring  study,  without  tak¬ 
ing  into  account  possible  limitations  of  computer 
tutoring.  The  exercises  were  then  transferred  into 
a  computer  system  with  only  minor  adjustments 
(e.g.,  breaking  down  compound  questions  into  in¬ 
dividual  questions).  This  resulted  in  a  realistic  tu¬ 
toring  setup,  which  presents  interesting  challenges 
to  language  processing  components,  involving  a 
wide  variety  of  language  phenomena. 

We  demonstrate  a  version  of  the  system  that 
has  undergone  a  successful  user  evaluation  in 
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2009.  The  evaluation  results  indicate  that  addi¬ 
tional  improvements  to  remediation  strategies,  and 
especially  to  strategies  dealing  with  interpretation 
problems,  arc  necessary  for  effective  tutoring.  At 
the  same  time,  the  successful  large-scale  evalua¬ 
tion  shows  that  Beetle  II  can  be  used  as  a  plat¬ 
form  for  future  experimentation. 

The  rest  of  this  paper  discusses  the  Beetle  II 
system  architecture  (Section  2),  system  evaluation 
(Section  3),  and  the  range  of  computational  lin¬ 
guistics  problems  that  can  be  investigated  using 
Beetle  II  (Section  4). 

2  System  Architecture 

The  Beetle  II  system  delivers  basic  electricity 
and  electronics  tutoring  to  students  with  no  prior 
knowledge  of  the  subject.  A  screenshot  of  the  sys¬ 
tem  is  shown  in  Figure  1 .  The  student  interface  in¬ 
cludes  an  area  to  display  reading  material,  a  circuit 
simulator,  and  a  dialogue  history  window.  All  in¬ 
teractions  with  the  system  arc  typed.  Students  read 
pre-authored  curriculum  slides  and  carry  out  exer¬ 
cises  which  involve  experimenting  with  the  circuit 
simulator  and  explaining  the  observed  behavior. 
The  system  also  asks  some  high-level  questions, 
such  as  “What  is  voltage?”. 

The  system  architecture  is  shown  in  Figure  2. 
The  system  uses  a  standard  interpretation  pipeline, 
with  domain-independent  parsing  and  generation 
components  supported  by  domain  specific  reason- 
ers  for  decision  making.  The  architecture  is  dis¬ 
cussed  in  detail  in  the  rest  of  this  section. 

2.1  Interpretation  Components 

We  use  the  TRIPS  dialogue  parser  (Allen  et  ah, 
2007)  to  parse  the  utterances.  The  parser  provides 
a  domain-independent  semantic  representation  in¬ 
cluding  high-level  word  senses  and  semantic  role 
labels.  The  contextual  interpreter  then  uses  a  refer¬ 
ence  resolution  approach  similar  to  Byron  (2002), 
and  an  ontology  mapping  mechanism  (Dzikovska 
et  al.,  2008a)  to  produce  a  domain-specific  seman¬ 
tic  representation  of  the  student’s  output.  Utter¬ 
ance  content  is  represented  as  a  set  of  extracted 
objects  and  relations  between  them.  Negation  is 
supported,  together  with  a  heuristic  scoping  algo¬ 
rithm.  The  interpreter  also  performs  basic  ellipsis 
resolution.  For  example,  it  can  determine  that  in 
the  answer  to  the  question  “Which  bulbs  will  be 
on  and  which  bulbs  will  be  off  in  this  diagram?”, 
“off”  can  be  taken  to  mean  “all  bulbs  in  the  di¬ 


agram  will  be  off.”  The  resulting  output  is  then 
passed  on  to  the  domain  reasoning  and  diagnosis 
components. 

2.2  Domain  Reasoning  and  Diagnosis 

The  system  uses  a  knowledge  base  implemented  in 
the  KM  representation  language  (Clark  and  Porter, 
1999;  Dzikovska  et  ah,  2006)  to  represent  the  state 
of  the  world.  At  present,  the  knowledge  base  rep¬ 
resents  14  object  types  and  supports  the  curricu¬ 
lum  containing  over  200  questions  and  40  differ¬ 
ent  circuits. 

Student  explanations  are  checked  on  two  levels, 
verifying  factual  and  explanation  correctness.  For 
example,  for  a  question  “Why  is  bulb  A  lit?”,  if 
the  student  says  “it  is  in  a  closed  path”,  the  system 
checks  two  things:  a)  is  the  bulb  indeed  in  a  closed 
path?  and  b)  is  being  in  a  closed  path  a  reason¬ 
able  explanation  for  the  bulb  being  lit?  Different 
remediation  strategies  need  to  be  used  depending 
on  whether  the  student  made  a  factual  error  (i.e., 
they  misread  the  diagram  and  the  bulb  is  not  in  a 
closed  path)  or  produced  an  incorrect  explanation 
(i.e.,  the  bulb  is  indeed  in  a  closed  path,  but  they 
failed  to  mention  that  a  battery  needs  to  be  in  the 
same  closed  path  for  the  bulb  to  light). 

The  knowledge  base  is  used  to  check  the  fac¬ 
tual  correctness  of  the  answers  first,  and  then  a  di- 
agnoser  checks  the  explanation  correctness.  The 
diagnoser,  based  on  Dzikovska  et  al.  (2008b),  out¬ 
puts  a  diagnosis  which  consists  of  lists  of  correct, 
contradictory  and  non-mentioned  objects  and  re¬ 
lations  from  the  student’s  answer.  At  present,  the 
system  uses  a  heuristic  matching  algorithm  to  clas¬ 
sify  relations  into  the  appropriate  category,  though 
in  the  future  we  may  consider  a  classifier  similar 
to  Nielsen  et  al.  (2008). 

2.3  Tutorial  Planner 

The  tutorial  planner  implements  a  set  of  generic 
tutoring  strategies,  as  well  as  a  policy  to  choose 
an  appropriate  strategy  at  each  point  of  the  inter¬ 
action.  It  is  designed  so  that  different  policies  can 
be  defined  for  the  system.  The  currently  imple¬ 
mented  strategies  are:  acknowledging  the  correct 
part  of  the  answer;  suggesting  a  slide  to  read  with 
background  material;  prompting  for  missing  parts 
of  the  answer;  hinting  (low-  and  high-  specificity); 
and  giving  away  the  answer.  Two  or  more  strate¬ 
gies  can  be  used  together  if  necessary. 

The  hint  selection  mechanism  generates  hints 
automatically.  For  a  low  specificity  hint  it  selects 
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Figure  1:  Screenshot  of  the  Beetle  II  system 


Figure  2:  System  architecture  diagram 
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an  as-yet  unmentioned  object  and  hints  at  it,  for 
example,  “Here’s  a  hint:  Your  answer  should  men¬ 
tion  a  battery.”  For  high-specificity,  it  attempts  to 
hint  at  a  two-place  relation,  for  example,  “Here’s 
a  hint:  the  battery  is  connected  to  something.” 

The  tutorial  policy  makes  a  high-level  decision 
as  to  which  strategy  to  use  (for  example,  “ac¬ 
knowledge  the  correct  part  and  give  a  high  speci¬ 
ficity  hint”)  based  on  the  answer  analysis  and  di¬ 
alogue  context.  At  present,  the  system  takes  into 
consideration  the  number  of  incorrect  answers  re¬ 
ceived  in  response  to  the  current  question  and  the 
number  of  uninterpretable  answers.1 

In  addition  to  a  remediation  policy,  the  tuto¬ 
rial  planner  implements  an  error  recovery  policy 
(Dzikovska  et  al.,  2009).  Since  the  system  ac¬ 
cepts  unrestricted  input,  interpretation  errors  are 
unavoidable.  Our  recovery  policy  is  modeled  on 
the  TargetedHelp  (Hockey  et  al.,  2003)  policy  used 
in  task-oriented  dialogue.  If  the  system  cannot 
find  an  interpretation  for  an  utterance,  it  attempts 
to  produce  a  message  that  describes  the  problem 
but  without  giving  away  the  answer,  for  example, 
“I’m  sorry.  I’m  having  a  problem  understanding.  I 
don’t  know  the  word  power.”  The  help  message  is 
accompanied  with  a  hint  at  the  appropriate  level, 
also  depending  on  the  number  of  previous  incor¬ 
rect  and  non-interpretable  answers. 

2.4  Generation 

The  strategy  decision  made  by  the  tutorial  plan¬ 
ner,  together  with  relevant  semantic  content  from 
the  student’s  answer  (e.g.,  paid  of  the  answer  to 
confirm),  is  passed  to  content  planning  and  gen¬ 
eration.  The  system  uses  a  domain-specific  con¬ 
tent  planner  to  produce  input  to  the  surface  realizer 
based  on  the  strategy  decision,  and  a  FUF/SURGE 
(Elhadad  and  Robin,  1992)  generation  system  to 
produce  the  appropriate  text.  Templates  are  used 
to  generate  some  stock  phrases  such  as  “When  you 
are  ready,  go  on  to  the  next  slide.” 

2.5  Dialogue  Management 

Interaction  between  components  is  coordinated  by 
the  dialogue  manager  which  uses  the  information- 
state  approach  (Larsson  and  Traum,  2000).  The 
dialogue  state  is  represented  by  a  cumulative  an¬ 
swer  analysis  which  tracks,  over  multiple  turns, 
the  correct,  incorrect,  and  not-yet-mentioned  parts 

1  Other  factors  such  as  student  confidence  could  be  con¬ 
sidered  as  well  (Callaway  et  al.,  2007). 


of  the  answer.  Once  the  complete  answer  has  been 
accumulated,  the  system  accepts  it  and  moves  on. 
Tutor  hints  can  contribute  parts  of  the  answer  to 
the  cumulative  state  as  well,  allowing  the  system 
to  jointly  construct  the  solution  with  the  student. 

3  Evaluation 

The  first  experimental  evaluation  involving  8 1  par¬ 
ticipants  (undergraduates  recruited  from  a  South¬ 
eastern  University  in  the  USA)  was  completed  in 
2009.  Participants  had  little  or  no  prior  knowledge 
of  the  domain.  Each  participant  took  a  pre-test, 
worked  through  a  lesson  with  the  system,  took  a 
post-test,  and  completed  a  user  satisfaction  survey. 
Each  session  lasted  approximately  4  hours. 

We  implemented  two  different  tutoring  policies 
in  the  system  for  this  evaluation.  The  baseline 
policy  used  an  “accept  and  bottom  out”  strategy 
for  all  student  answers,  regardless  of  their  con¬ 
tent.  The  students  were  always  given  the  correct 
answer,  but  the  system  made  no  attempt  at  reme¬ 
diation,  and  never  indicated  whether  the  student 
was  understood  (or  correct).  In  comparison,  the 
full  adaptive  policy  attempted  to  select  an  appro¬ 
priate  strategy  based  on  student  answer  analysis 
and  dialogue  context,  as  described  above.  Stu¬ 
dents  were  randomly  assigned  to  a  condition.  On 
average,  each  student  session  contained  232  stu¬ 
dent  language  turns  in  the  fully  adaptive  condition 
( SD  =  25.6)  and  156  turns  in  the  baseline  condi¬ 
tion  ( SD  =  2.02).  Additional  time  was  taken  by 
reading  and  interacting  with  the  simulation  envi¬ 
ronment. 

Out  of  81  students,  76  successfully  completed 
the  evaluation  (data  from  5  participants  had  to  be 
discarded  due  to  system  crashes).  There  was  no 
significant  difference  in  learning  gain  between  the 
two  conditions  (baseline:  M  =  0.63,  SD  =  0.26, 
full:  M  =  0.57 ,SD  =  0.23,  7-test  p  =  0.32), 
and  students  gave  higher  satisfaction  scores  to  the 
baseline  condition  (baseline:  M  =  3.32,  SD  = 
0.65,  full:  M  =  2.56,  SD  =  0.65,  7-test,  p  < 
0.01).  We  arc  currently  analyzing  the  data  to  in¬ 
vestigate  the  reasons  for  this  outcome.  However, 
the  evaluation  demonstrated  that  the  system  can  be 
successfully  deployed  in  experiments  with  naive 
users  and  therefore  can  be  used  to  study  the  im¬ 
pact  of  different  natural  language  processing  algo¬ 
rithms  on  learning  gain  and  user  satisfaction. 

An  example  dialogue  with  the  system,  taken 
from  the  evaluation  corpus,  is  shown  in  Figure  3. 
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Tutor:  Why  was  bulb  A  on  when  switch  Y 

was  open  and  switch  Z  was  closed? 

Student:  because  it  had  a  closed  path 

Tutor:  Right.  There  is  a  closed  path.  Keep 

going. 

Student:  there  was  a  closed  path  not  effected  by 
the  switch 

Tutor:  I'  m  sorry.  I’m  having  a  problem  un¬ 

derstanding.  I  don’t  understand  the 
word  ’effect’  here.  Here’s  a  hint.  Your 
answer  should  mention  a  battery. 

Student:  it  has  a  closed  path  with  the  battery 

Tutor:  You’re  on  the  right  track.  Here’s  the 

answer.  Bulb  A  is  still  contained  in  a 
closed  path  with  the  battery. 

Figure  3:  Example  interaction  with  the  system 
from  our  corpus 

It  shows  three  key  system  properties:  after  the  stu¬ 
dent’s  first  turn,  the  system  rephrases  its  under¬ 
standing  of  the  correct  paid  of  the  student  answer 
and  prompts  the  student  to  supply  the  missing  in¬ 
formation.  In  the  second  turn,  the  student  utter¬ 
ance  could  not  be  interpreted  and  the  system  re¬ 
sponds  with  a  targeted  help  message  and  a  hint 
about  the  object  that  needs  to  be  mentioned.  Fi¬ 
nally,  in  the  last  turn  the  system  combines  the  in¬ 
formation  from  the  tutor’s  hint  and  the  student’s 
answers  and  restates  the  complete  answer  since  the 
current  answer  was  completed  over  multiple  turns. 

4  Conclusions  and  Future  Work 

The  Beetle  II  system  we  present  was  built  to 
serve  as  a  platform  for  research  in  computational 
linguistics  and  tutoring,  and  can  be  used  for  task- 
based  evaluation  of  algorithms  developed  for  other 
domains.  We  arc  currently  developing  an  annota¬ 
tion  scheme  for  the  data  we  collected  to  identify 
student  paraphrases  of  correct  answers.  The  an¬ 
notated  data  will  be  used  to  evaluate  the  accuracy 
of  existing  paraphrasing  and  textual  entailment  ap¬ 
proaches  and  to  investigate  how  to  combine  such 
algorithms  with  the  current  deep  linguistic  analy¬ 
sis  to  improve  system  robustness.  We  also  plan 
to  annotate  the  data  we  collected  for  evidence  of 
misunderstandings,  i.e.,  situations  where  the  sys¬ 
tem  arrived  at  an  incorrect  interpretation  of  a  stu¬ 
dent  utterance  and  took  action  on  it.  Such  annota¬ 
tion  can  provide  useful  input  for  statistical  learn¬ 
ing  algorithms  to  detect  and  recover  from  misun¬ 


derstandings. 

In  dialogue  management  and  generation,  the 
key  issue  we  are  planning  to  investigate  is  that  of 
linguistic  alignment.  The  analysis  of  the  data  we 
have  collected  indicates  that  student  satisfaction 
may  be  affected  if  the  system  rephrases  student 
answers  using  different  words  (for  example,  using 
better  terminology)  but  doesn’t  explicitly  explain 
the  reason  why  different  terminology  is  needed 
(Dzikovska  et  ah,  2010).  Results  from  other  sys¬ 
tems  show  that  measures  of  semantic  coherence 
between  a  student  and  a  system  were  positively  as¬ 
sociated  with  higher  learning  gain  (Ward  and  Lit- 
man,  2006).  Using  a  deep  generator  to  automati¬ 
cally  generate  system  feedback  gives  us  a  level  of 
control  over  the  output  and  will  allow  us  to  devise 
experiments  to  study  those  issues  in  more  detail. 

From  the  point  of  view  of  tutoring  research, 
we  arc  planning  to  use  the  system  to  answer 
questions  about  the  effectiveness  of  different  ap¬ 
proaches  to  tutoring,  and  the  differences  between 
human-human  and  human-computer  tutoring.  Pre¬ 
vious  comparisons  of  human-human  and  human- 
computer  dialogue  were  limited  to  systems  that 
asked  short-answer  questions  (Fitman  et  ah,  2006; 
Rose  and  Torrey,  2005).  Having  a  system  that  al¬ 
lows  more  unrestricted  language  input  will  pro¬ 
vide  a  more  balanced  comparison.  We  arc  also 
planning  experiments  that  will  allow  us  to  eval¬ 
uate  the  effectiveness  of  individual  strategies  im¬ 
plemented  in  the  system  by  comparing  system  ver¬ 
sions  using  different  tutoring  policies. 
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